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Abstract 

I We propose a simple and efficient algorithm for learning sparse invariant repre- 

ss! ■ sentations from unlabeled data with fast inference. When trained on short movies 

sequences, the learned features are selective to a range of orientations and spatial 
frequencies, but robust to a wide range of positions, similar to complex cells in the 
primary visual cortex. We give a hierarchical version of the algorithm, and give 
I guarantees of fast convergence under certain conditions 
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1 Introduction 



Learning representations that are invariant to irrelevant transformations of the input is an important 
step towards building recognition systems automatically. Invariance is a key property of some cells 
f-*) i in the mammalian visual cortex. Cells in high-level areas of the visual cortex respond to objects 

' categories, and are invariant to a wide range of variations on the object (pose, illumination, confir- 

l/^ , mation, instance, etc). The simplest known example of invariant representations in visual cortex are 

the complex cells of VI that respond to edges of a given orientation but are activated by a wide range 
of positions of the edge. Many artificial object recognition systems have built-in invariances, such 
as the translational invariance of convolutional network [Ij, or SIFT descriptors [2\. An important 
question is how can useful invariant representations of the visual world be learned from unlabeled 
samples. 



In this paper we introduce an algorithm for learning features that are invariant (or robust) to common 
image transformations that typically occur between successive frames of a video or statistically 
within a single frame. While the method is quite simple, it is also computationally efficient, and 
possesses provable bounds on the speed of inference. The first component of the model is a layer 
of sparse coding. Sparse coding ^ constructs a dictionary matrix W so that input vectors can 
be represented by a linear combination of a small number of columns of the dictionary matrix. 
Inference of the feature vector z representing an input vector x is performed by finding the z that 
minimizes the following energy function 

Ei{W,x,z) = i||x-M^zf + a|z|ii (1) 

where a is a positive constant. The dictionary matrix W is learned by minimizing min^ E(W, , z) 
averaged over a set of training samples x'^ k = 1 . . . K, while constraining the columns of W to 
have norm 1 . 

The first idea of the proposed method is to accumulate sparse feature vectors representing successive 
frames in a video, or versions of an image that are distorted by transformations that do not affect the 
nature of its content. 

argmin^-jlaJt — + Q;|z|ii (2) 
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where the sum runs over the distorted images xt- The second idea is to connect a second sparse 
coding layer on top of the first one that will capture dependencies between components of the accu- 
mulated sparse code vector This second layer models vector z* using an invariant code u, which is 
the minimum of the following energy function 



where |m| denotes the LI norm of u, A is a matrix, and /3 is a positive constant controlling the 
sparsity of u. Unlike with traditional sparse coding, in this method the dictionary matrix interacts 
multiplicatively with the input z* . As in traditional sparse coding, the matrix A is trained by gradient 
descent to minimize the average energy for the optimal u over a training set of vectors z* obtained 
as stated above. The columns of A are constrained to be normalized to 1 . Essentially, the matrix 
A will connect a component of u to a set of components of z* if these components of z* co-occur 
frequently. When a component of u turns on, it has the effect of lowering the coefficients of the 
components of \z*\ to which it is strongly connected through the A matrix. To put it another way, 
if a set of components of z* often turn on together, the matrix A will connect them to a component 
of u. Turning on this component of u will lower the overall energy (if /3 is small enough) because 
the whole set of components of z* will see their coefficient being lowered (the exponential terms). 
Hence, each unit of u will connect units of that often turn on together within a sequence of images. 
These units will typically represent distorted version of a feature. 

The energies ^ and (|5]l can be naturally combined into a single combined model of z and u as 
explained in section |2l There the second layer u is essentially modulating sparsity of the first layer 
z. Single model of the image is more natural. For the invariance properties we didn't find much 
qualitative difference and since the former has provable inference bounds we presented the results 
for separate training. However the a two layer model should capture the statistics of an image. To 
demonstrate this we compared the in-paining capability of one and two layer models and found that 
two layer model does better job. For these experiments, the combined two layer model is necessary. 
We also found that despite the assumptions of the fast inference are not satisfied for the two layer 
model, empirically the inference is fast in this case as well. 

1.1 Prior work on Invariant Feature Learning 

The first way to implement invariance is to take a known invariance, such as translational invariance 
in images, in put it directly into the architecture. This has been highly successful in convolutional 
neural networks [1] and SIFT descriptors [2 | and its derivatives. The major drawback of this ap- 
proach is that it works for known invariances, but not unknown invariances such as invariance to 
instance of an object. A system that would discover invariance on its own would be desired. 

Second type of invariance implementation is considered in the framework of sparse coding or in- 
dependent component analysis. The idea is to change a cost function on hidden units in a way that 
would prefer co-occurring units to be close together in some space 1 4 , 5 1 . This is achieved by pooling 
units close in space together This groups different inputs together producing a form of invariance. 
The drawback of this approach is that it requires some sort of imbedding in space and that the filters 
have to arrange themselves. 

In the third approach, rather then forcing units to arrange themselves, we let them learn whatever 
representations they want to learn and instead figure out which to pool together In 161 17], this was 
achieved by modulating covariance of the simple units with complex units. 

The fourth approach to invariance uses the following idea: If the inputs follow one another in time 
they are likely a consequence of the same cause. We would like to discover that cause and therefore 
look for representations that are common for all frames. This was achieved in several ways. In 
slow feature analysis IS] |9] [TOl one forces the representation to change slowly. In temporal product 
network 111] one breaks the input into two representations - one that is common to all frames and one 
that is complementary. In fT2| the idea is similar but in addition the complementary representation 
specifies movement. In the simplest instance of hierarchical temporal memory 1.1 3 J one forms groups 
based on transition matrix between states. The lfT4l is a structured model of video. 




(3) 



(4) 
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A lot of the approaches for learning invariance are inspired by the fact that the neo-cortex learns to 
create invariant representations. Consequently these approaches are not focused on creating efficient 
algorithms. In this paper, we given an efficient learning algorithm that falls into the framework 
of third and fourth approaches. The basic idea is to modulate the sparsity of sparse coding units 
using higher level units that are also sparse. The fourth approach is implemented by using the 
same higher level representation for several consecutive time frames. In the form our model is 
similar to that of 7 1 but a little simpler In a sense comparing our model to (|6] |7] is similar to 
comparing sparse coding to independent component analysis. Independent component analysis is 
a probabilistic model, whereas sparse coding attempts to reconstruct input in terms of few active 
hidden units. The advantage of sparse coding is that it is simpler and easier to optimize. There 
exist several very efficient inference and learning algorithms [l5][16l[17][T8][19l and sparse coding 
has been applied to a large number problems. It is this simplicity that allows efficient training of 
our model. The inference algorithm is closely derived from the fast iterative shrinkage-thresholding 
algorithm (FISTA) L15J and has a convergence rate of 1/fc^ where k is the number of iterations. 

2 The Model 

The model described above comprises two separately trained modules, whose inference is performed 
separately. However, one can devise a unified model with a single energy function that is conceptu- 
ally simpler: 

EiW,A,x,z,u) - i^||a;t-M^ztf+^^a|z,,|5(«), + (5) 

t i t 

g{u\ = (l + e-('4")')/2, h{u)^l3\u\ 

Given a set of inputs , the goal of training is to minimize E(W, A, , , u^). We do this 
by choosing one input a; at a time, minimizing (|5]l over z and u with W and A fixed, then fixing 
the resulting Zniin,Umin and taking step in a negative gradient direction of W, A (stochastic gradient 
descent). An algorithm for finding 2„iin,Uniin is given in section|4] It consists of taking step in z and 
u separately, each of which lowers the energy. 

Note: The g functions in (|5]l is different from that of the simple (split) model. The reason is that, in 
our experiments, either u units lower the sparsity of z too much, not resulting in a sparse z code or 
the units u do not turn on at all. 

2.1 A Toy Example 

We now describe a toy example that illustrates the main idea of the model fSOl. The input, with n.t = 
1, is an image patch consisting of a subset of the set of parallel lines of four different orientations 
and ten different positions per orientation. However for any given input, only lines with the same 
orientation can be present. Figure |6^ (different orientations have equal probability and for a given 
orientation a line of this orientation is present with probability 0.2 independently of others). This 
is a toy example of a texture. Training sparse coding on this input results in filters similar to one 
in Figure ^p. We see that a given simple unit responds to one particular line. The noisy filters 
correspond to simple units that are inactive - this happens because there are only 40 discrete inputs. 
In realistic data such as natural images, we have a continuum and typically all units are used. 

Clearly, sparse coding cannot capture all the statistics present in the data. The simple units are not 
independent. We would like to learn that that units corresponding to lines of a given orientation 
usually turn on simultaneously. We trained (|5) on this data resulting in the filters in the Figure|6j3,c. 
The filters of the simple units of this full model are similar to those obtained by training just the 
sparse coding. The invariant units pool together simple units with filters corresponding to lines of 
the same orientation. This makes the invariant units invariant to the pattern of lines and dependent 
only on the orientation. Only four invariant units were active corresponding to the four groups. As 
in sparse coding, on a realistic data such as natural images, all invariant units become active and 
distribute themselves with overlapping filters as we will se below. 

Let us now discuss the motivation behind introducing a sequence of inputs (?it > 1) in (|5]l. Inputs 
that follow one another in time are usually a consequence of the same cause. We would like to 
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input simple units invariant units 




Figure 1: Toy example trained using model (|5]l. a) Randomly selected input image patches. The 
input patches are generated as follows. Pick one of the four orientations at random. Consider all 
lines of this orientation. Put any such line into the image independently with probability 0.2. b) 
Learned sparse coding filters. A given active unit responds to a particular line, c) Learned filters of 
the invariant units. Each row corresponds to an invariant unit. The sparse coding filters are ordered 
according to the strength of their connection to the invariant unit. There are only four active units 
(arrows) and each responds to a given orientation, invariant to which lines of a given orientation are 
present. 



discover that cause. This cause is something that is present at all frames and therefore we are 
looking for a single representation u in (|5j that is common to all the frames. 

Another interesting point about the model (|5} is a that nonzero u lowers the sparsity coefficient of 
units of z that belong to a group making them more likely to become activated. This means that the 
model can utilize higher level information (which group is present) to modulate the activity of the 
lower layer This is a desirable property for multi-layer systems because different parts of the system 
should propagate their belief to other parts. In our invariance experiments the results for the unified 
model were very similar to the results of the simple (split) model. Below we show the results of this 
simple model because it is simple and because we provably know an efficient inference algorithm. 
However in the section |4] we will revisit the full system, generalize it to an n-layer system, give an 
algorithm for training it, and prove that under some assumptions of convexity, the algorithm again 
has a provably efficient inference. In the final section we use the full system for in-paining and show 
that it generalizes better then a single layer system. 

3 Efficient Inference for the Simplified Model 

Here we discuss how to find u efficiently and give the numerical results of the paper. The results for 
the full model Q were similar. 

3.1 FISTA training algorithm 

The advantage of Q compared to (|5]l is that the fast iterative shrinkage-thresholding algorithm 
(FISTA) im applies to it directly. FISTA apphes to problems of the form E{u) = f(u) + g{u) 
where: 

• / is continuously differentiable, convex and Lipschitz, that is ||V/(ui) — V/(?i2)|| < 
L{f)\\ui — U2\\- The > is the Lipschitz constant of V/. 

• 5 is continuous, convex and possibly non-smooth 

The problem is assumed to have solution E* = E{u*). In our case f{u) = a\zt\g{u) and 
g{u) — h{u) which satisfies these assumptions {A is initialized with nonnegative entries which 
stay nonnegative during the algorithm without a need to force it). This solution converges with 
bound E{uk) ~ E{u*) < 2aL{f)\\uQ — u*\\'^/{k + 1)^ where Uk is the value of u at the k-th 
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iteration and a is a constant. The cost of each iteration is 0{rnn) where n is the input size and m is 
the output size. More precisely the cost is one matrix multipUcations by A and by plus 0{m + n) 
cost. We used the back-tracking version of the algorithm to find L which contains a fixed number 
of 0{mn) operations (independent of desired error). It is a standard knowledge and easy to see that 
the algorithm applies to the sparse coding ([T) as well. 

3.2 Results 

The input to the network was prepared as follows. We converted all the images of the Berke- 
ley data-set into gray-scale images. We locally removed the mean for each pixel by subtracting a 
Gaussian-weighted average of the nearby pixels. The width of the Gaussian was 9 pixels. Then, we 
locally normalized the contrast by dividing each pixel by Gaussian-weighted standard deviation of 
the nearby pixels (with a small cutoff to prevent blow-ups). The width of the Gaussian was also 9 
pixels. Then, we picked a 20 x 20 window in the image and, for a randomly chosen direction and 
magnitude, we moved it for rit — 2> frames and extracted them. The magnitude of the displacement 
was random in the range of 1 — 2 pixels. A very large collection of such triplets of frames was 
extracted. We trained the sparse coding algorithm with 400 code units in z on each individual frame 
(not on the rit concatenated frames). After training we found the sparse codes for each frame. There 
were 100 units in the layer of invariant units u. For larger a system with 1000 simple units and 400 
invariant units, see the supplementary material. 




Figure 2: Connectivity between invariant units and simple units. Left: Each row correspond to an 
invariant unit. For a given invariant unit, the filters of the simple units were ordered according to the 
size of the weight to the invariant unit. The strongest 9 are plotted. Middle: Each square correspond 
to an invariant unit. 25 out of 100 invariant units were selected at random. The sparse coding filters 
were fitted with gabor functions. The center position of each line is the center of the Gabor filter The 
orientation is the orientation of the sine-wave of the Gabor filter. The brightness is proportional to the 
strength of connection between invariant and simple units. For each invariant unit the weights were 
normalized for drawing so that the strongest connection is white and zero connections are black. 
Right: Each square (circle) corresponds to an invariant unit. Each dot correspond to a simple unit. 
The distance from the center is the frequency of the Gabor fit. The angle is twice the orientation of 
the Gabor fit (twice because angles related by tt are equivalent). We see that invariant units typically 
learn to group together units with similar orientation and frequency. There are few other types of 
filters as well. The units in the middle and right panel correspond to each other and correspond to 
the units in the left panel reading panels left to right and then down. See the supplementary material 
for all the filters the system: 20 x 20 input patches, 1000 simple units, 400 invariant units. 

The results are shown in the Figure |2] see caption for description. We see that many invariant 
cells leam to group together filters of similar orientation and frequency but at several positions and 
thus learn invariance with respect to translations. However there are other types of filters as well. 
Remember that the algorithm learns statistical co-occurrence between features, whether in time or 
in space. 
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Figure 3: Responses of simple units and invariant units to the input edge from equation dSJ. Left 
panel are responses of simple units trained with sparsity a = 0.5 in The right four panels 
are responses of invariant units trained with sparsities (3 = 0.5, 0.3, 0.2, 0.1 in (3) on the values of 
simple units. The x-axis of each panel is the distance of the edge from the center of the image - the 
b in (O. The y-axis is the orientation of each edge - the 6 in ([8]). 30 cells were chosen at random 
in each panel. Different colors correspond to different cells. The color intensity is proportional to 
the response of the unit. We see that sparse coding inputs respond to a small range of frequencies 
and positions (the elongated shape is due to the fact that an edge of orientation somewhat different 
from the edge detector orientation sweeps the detector at different positions b). On the other hand 
invariant cells respond to edges of at similar range of frequencies but larger range of positions. At 
high sparsities the response boundaries are sharp and response regions don't overlap. As we lower 
the sparsity the boundaries become more blurry and regions start to overlap, k = 1 was used in (jSj. 
Other frequencies produced similar effect. 

The values of the weights give us important information about the properties of the system. However 
ultimately we are interested in how the system responds to an input. We study the response of these 
units to a commonly occurring input - an edge. Specifically the inputs are given by the following 
function. 



where (x, y) is the position of a pixel from the center of a patch, b a real number specifying distance 
of the edge from the center and 9 is the orientation of the edge from the x axis. This is not an edge 
function, but a function obtained on an edge after local mean subtraction. 

The responses of the simple units and the invariant units are shown in the Figure. |3] see caption for 
description. As expected the sparse coding units respond to edges in a narrow range of positions and 
relatively narrow range of orientations. Invariant cells on the other hand are able to pool different 
sparse coding units together and become invariant to a larger range of positions. Thus the invariant 
units do indeed have the desired invariance properties. 

Note that for large sparsities the regions have clear boundaries and are quite sharp. This is similar 
to the standard implementation of convolutional net, where the pooling regions are squares (with 
clear boundaries). It is probably more preferable to have regions that overlap as happens at lower 
sparsities since one would prefer smoother responses rather then jumps across boundaries. 

4 Theoretical analysis of the full model and its multi-layer generalization 

In this section we return to the full model (|5). We generalize it to an n layer system, give an inference 
algorithm and outline the proof that under certain assumptions of convexity the algorithm has the 
fast 1/fc^ convergence of FISTA, there k is the iteration number. 



X{x,y) 




(6) 
(7) 
(8) 



V — k ■ f+ kb 

k = fc(cos0,sin0) and r = (a;,y) 
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The basic idea of minimizing over z, m in (|5]l is to alternate between taking energy-lowering step 
in z while fixing u and taking energy-lowering step in u while fixing z. Note that both of the 
restricted problems (problem in z fixing u and problem in u fixing z) satisfy conditions of the 
FISTA algorithm. This will allow us to take steps of appropriate size that are guaranteed to lower 
the total energy. Before that however, we generalize the problem, which will reveal its structure and 
which does not introduce any additional complexity. 

Consider system consisting of n layers with units Za in the a-th layer with a — 1, . . . , 0. We define 
z = (zi, . . . , Zn) that is all the vectors Za concatenated. We define two sets of functions. Let 
e^(za) be continuously differentiable, convex and Lipschitz functions. There can be several such 
functions per layer, which is denoted by index /?. Let g^{za) be continuous and convex functions, 
not necessarily smooth. For convenience we define zq ~ 0, z„+i = 0, go = 1 and e„+i = 1. 

We define the energy of the system to be 

n n 

E = J2J29a{Za)e^+l{Za+l) = ^ ffa (Za)ea+1 ) (9) 

a=0 p a=0 

where in the second equality we drop the /? from the notation for simplicity. We will omit writing 
the /3 for the rest of the paper The equation (|5) is a special case of (|9]l with n = 2, qq = l,ei(zi) = 
(l/2)||a: - Wzif, gi(zi) = |zi|, 62(^2) = a(l + e-^^^)/2, 32(22) - P\z2\ with Z2 > and 

63 = 1. 

Now observe that given a G {1, . . . , n} the problem in Za keeping other variables fixed satisfies the 
conditions of the FISTA algorithm. We can define a step in the variable Za to be (in analogy to IfTSl 
eq. (2.6)): 



PlAzo) = argmin^, {5(z;)ea+i(za+i) + 



1 ' 

{Za - Y^ga-l{Za-l)y ea{Za)) } 



= Sh e„_|_i(.„_|_i) ^Zg - -^ga-l(Za-l)Vea(Za)^ (10) 

where the later equality holds if ga{za) = \za\- Here sh is the shrinkage function shQ(z) = 
sign(z)(|z| — a)+. In the case where Za is restricted to be nonnegative we need to use (z — 
instead of the shrinkage function. 

Let us describe the algorithm for minimizing (|9) with respect to z (we will write it explicitly below). 
In the order from a = 1 to a = n, take the step z^ = p^^ (zg) in ( fTOl ). Repeat until desired accuracy. 
The La's have to be chosen so that 

ga-l{Za-l)ea{z'a) < 5a- 1 )ea (^a) + 

{z'a - Za,ga-l{Za-l)Vea{Za)) + iL/2)\z'^ - zj^ (11) 

This can be assured by taking La > L{ga-iea) where the later L denotes the Lipschitz constant 
of its argument. Otherwise, as used in our simulations, it can be found by backtracking, see below. 
This will assure that each steps lowers the total energy (|9) and hence the overall procedure will keep 
lowering it. In fact the step pL with such chosen L is in some sense a step with ideal step size. Let 
us now write the algorithm explicitly: 

Hierarchical (F)ISTA. 

Step 0. Take Lg > 0, some 77 > 1 and Zq G 3?. Set zi — Zq, ti — 1, a — 1, . . . ,n. 

Step k. (k > 1). 

Loop a=l:n { Backtracking { 

Find smallest nonnegative integer such that with Z° — if^ L'^_^ 

ga-i{~zr')ea{PL.{zl)) < 5a-i(2r')e(4)+ (12) 



SetL^-Ty'^i^il 
Compute 



{pL^izD - z,^5a(z^l)Ve(z,")) + -\pUzI) - zl\^ (13) 



-VM) } (14) 
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(15) 



Loopa=l:n{ z^+i = + rfc(5g - 5g„i) } (16) 

The algorithm described above is this algorithm with the choice rk = in the second last line. 

Let us discuss the rfe's. For single layer system (n = 1) = choice is called ISTA and has 
convergence bound of E{zk) — E{z*) < aL\\zo — z*|p/2fc. The other choice of is the FISTA 
algorithm. The convergence rate of FISTA is much better than that of ISTA, having fc^ in the 
denominator 

For hierarchical system, the choice = guarantees that each step lowers the energy. The question 
is whether introducing the other choice of rfc would speed up convergence to the FISTA convergence 
rate. The trouble is that the in general the product ge is non-convex, which is the case for (|5]l. For 
example we can readily see that if the function has more then one local minima, this convergence 
would certainly not be guaranteed (imagine starting at a non-minimal point with zero derivative). 
The effect of is that of momentum and this momentum increases towards one as k increases. 
With such a large momentum the system is in a danger of "running around". It might be effective to 
introduce this momentum but to regularize it (say bound it by a number smaller then one). In any 
case one can always use the algorithm with rfc = 0. 

In the special cases when all ge's are convex however, we give an outline of the proof that the 
algorithm converges with the FISTA rate of convergence. For this purpose we define the full step in 
z, Pl{z), to be the result of the sequence of steps z'^ — pL^i^a) eq- ( flOl t from a = 1 to a = n. That 
is we have Pi : {zi, Z2, ■ ■ ■ , Zn) {z[, Z2, . ■ . , z^) {z[, z'^, . . . , z^) {z[, z'^, . . . , z'^). We 
assume that all the La's are the same (this is always possible by making all La's equal the largest 
value). 

The core of the proof is to show the Lemma 2.3 of ifTSl : 

Lemma 2.3: Assume that Va e {0, . . . , n}, Ca is continuously differentiable, Lipschitz, ga is con- 
tinuous, gaCa+i is convex and is defined by the sequence of p^'s of (fTOl i as described above. 
Then for any z, z 

E{z) - E{Pl{z)) > ^\Pl{z)-z\^ + L{z~z,Pl{z)-z) (17) 

The proof in (15] shows that if the algorithm consists of applying the sequence of Pl's and these 
Pl's satisfy Lemma 2.3, then the algorithm converges with rate E{zk) — E{z^,) < 2Lmax{zo — 
z^Y I (k + 1)'^. Thus we need to prove Lemma 2.3. We start with the analog of Lemma 2.2 of (| 15|). 

Lemma 2.2: For any z, one has = p°^{za) if and only if there exist 7a(za) G dga{za), the 
subdifferential of ga(-), such that 

g{za-i)Ve{za) + L{z'a - Za) + e(z„+i)7(z„) = (18) 
This lemma follows trivially from the definition of p^^ {za) as in ifTSl . 
Proof of Lemma 2.3: Define z' = Pl{z)- From convexity we have 

ga[Za)ea+l[Za+l) > 5a (4 )ea+l (^a+l ) + (Za " 4 , (Za+1 )7a (^a)) + 

«+l - Za+l,gaiZaWea+l{Za+l)) (19) 

Next we have the property (l22t . However the Za-i should be primed iz'a_i) because the Zq-i has 
already been updated. Due to space limitations we won't write out all the calculations but specify 
the sequence of operations. The details are written out in the supplementary material. We take the 
first term on the left side of (l20t . E{z) and express it in its terms (|9]l. Then, replace the terms using 
the convexity equations and substitute 7(zq)'s using the Lemma 2.2. Then we take the second term 
of the left side of ( |20l ), E{Pl{z)), again express it using (|9]l, and use the inequalities (|22] |. Putting 
it all together, all the gradient terms cancel and the other terms combine to give Lemma 2.3. This 
completes the proof. 
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5 Conclusions 



We introduced simple and efficient algorithm from learning invariant representation from unlabelled 
data. The method takes advantage of temporal consistency in sequential image data. In the future 
we plan to use the invariant features discovered by the method to hierarchical vision architectures, 
and apply it to recognition problems. 
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A Supplementary material for Efficient Learning of Sparse Invariant 
Representations 

1) We give details of the Proof of Lemma 2.3. 

2) We show all of the invariant filters for system of: 20x20 patches input patches, 1000 simple units, 400 
invariant units. 



A.l Lemma 2.3 

Lemma 2.3: Assume that Va G {0, . . . , n}, ea is continuously differentiable, Lipschitz, Qa is continuous, 
Qada+i is convex and Pl is defined by the sequence of p_l's in the paper. Then for any z, z 

E{z) - E{Pl{z)) > ^\Pl{z) - zf + L{z - z,Pl{z) - z) (20) 



Proof of Lemma 2.3: Define z' = Pl{z). We first collect the inequalities that we will need. 
From convexity we have 

ga{Za)ea + l{Za + l) > ga{z'a)ea + l{Za + l) + {Za " z'^, ea + l{Za + l)'ya{Za)) + 

(4 + 1 - Za + 1, gaiz'a)\/ea + l{Za + l)) (21) 

Next we have the property for step p_L^ that guarantees that the energy is lowered in each step. 

ga-l{Za-.l)ea{z'a) < 5a- 1 (4- 1 ) Ca (^a ) + 

{z'a - Za,ga-l{Za-l)S/ea{Za)) + {L/2)\z'a - Zaf (22) 

Finally we have the Lemma 2.2 

g{z'a-l)'Ve{Za) + L(z'a - Za) + e{Za + l)'y (Za) = (23) 

Now we can put these equations together. The steps are: Write out the left side of l l20t in terms of the definition 
of E. Use inequalities l l21b and l l22b . Eliminate 7's using J23t . Simplify. Here are the details: 

E{z)-E{z') = ^^ga{Za)ea+l{Za + l) " ga{z'a)ea + l{z'a + l) 
a 

- ^^9^i^'a)e-a+l{Za + l) + {Za - z'a, ea + -l{Za + l)l a{Za)) + {z'a + 1 - Za + 1 , (4 ) VCa + l (Za+ 1 ) > 
a 

~ga~\{Za-l)ea{Za) - {z'a - Za , ffa - 1 (^1 _ 1 ) VCa (Za ) ) - {L/2)\z'a - Zaf 

= ^3a(2l)ea+l(Za + l) - L{Za - 4,4 ^ Za) - (Za ^ z'a, g{z'a-l)V e{Za)) 
a 

+ (4+1 - Za+l,5a(4)Vea+l(Za + l)) 

-3a-l(4-l)ea(Za) ^ (4 ^ Za , ffa - 1 (4 - 1 ) VCa (Za ) ) - (L/2)|4 ~ Zaf 
= -L{Za - z'a, z'a ~ Za) - {L/2)\z'a - Zaf 
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— —L{z — z , z' — z) ~ {L/2)\z' — z\ 
= {L/2)\z' - z\^ + L{z - z,z - z) (24) 

which is the formula J22t . Note that in the line 5 and in the first term of lines 6 we shifted a by one. This 
completes the proof. 



A.2 Simple unit and invariant unit filters, a — 0.5, /3 — 0.3 
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Figure 4: Sparse coding filters. Inputs 20x20 images patches, preprocessed. code: 1000 simple 
units. 
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Figure 5: Selection of invariant units. See Figure 2a of the paper for explanation. System: 20x20 
patches, 1000 simple units, 400 invariant units. 
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Figure 6: All 400 invariant units. See Figure 2b of the paper for explanation. System: 20x20 patches, 
1000 simple units, 400 invariant units. 
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Figure 7: All 400 invariant units. Position of the invariant cells (centers of gabor fits). System 
20x20 patches, 1000 simple units, 400 invariant units. 



14 



Figure 8: All 400 invariant units. See Figure 2c of the paper for explanation. System: 20x20 patches, 
1000 simple units, 400 invariant units. 
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